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A bigger picture 



This thesis forms the basis of an overarching project: We wish to experiment with 
language design and optimizations. We seek a language with flexibility beyond 
Lisp and with good syntax. Our base language is Ruby, which is often described 
as Lisp with syntax. We strive for high performance. One of our goals is to 
make a dynamic optimizations and optimizations made on the data structure 
level possible. 

In this work we solve one aspect of this project: A major part of every compiler 
implementation consists of various forms of pattern matching often written in ad- 
hoc way. For example tokenization, parsing, expression simplification, dataflow 
analysis or instruction selection can be seen as instances of pattern matching. We 
introduce amethyst which is a tool for pattern matching of arbitrary data. 

To reach our goals of high effectivity and flexibility we use top-down parsers. 
For a long time bottom-up parsers were viewed as the only alternative to handle 
reasonably rich class of languages. However top-down parsers have received a lot 
of attention recently. 

The new formalism of boolean grammars [28] extends context free grammars (in- 
troduced by Chomsky in 1956 [S]) to a wider family of languages that includes 
most of programming languages. A variant of the top-down parser that archives 
linear time by memoization was introduced by Ford [12]. These parsers can be 
generated from description in PEG format. We extend this research by introduc- 
ing notion structured grammars that overcomes several limitations of PEG. We 
provide a linear time algorithm for parsers of structured grammars that gives ex- 
actly the same output as a backtracking top-down parser. The class of languages 
recognized by PEG is equivalent to the class REG^^*^ recognized by amethyst. 

Amethyst takes inspiration from an OMeta (2007) [42j which extended parsing ex- 
pression grammars (2002), which extended regular expressions (1956), [19] which 
were introduced as a way to describe finite state machines (1943) [22]- 

One of the extensions made in OMeta is pattern matching of tree-like data struc- 
tures. We further extend this work in several respects. One is extending pattern 
matching to arbitrary data structures. In OMeta there are hints of functional 
language. Amethyst provides several high-level constructs known from function- 
al languages (lambdas, trackable state). A goal is to make grammars made in 
amethyst more maintainable. 

We introduce a framework to make parsing dynamic (Chapter [3]), probably the 
first time this has been done. An editor can add or remove characters and obtains 
updates to a syntax tree. A dynamic parser recomputes only rules it needs 
to recompute. For typical workloads a change takes only O(logra) time. One 
application will be to make syntax highlighting and other tools easier to write 
and more accurate. 

As a step in experimenting with language design we created a simple dynamic 
programming language called Peridot. 
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1. Amethyst language 



Amethyst is a pattern matching system. 

The purpose of this chapter is to introduce amethyst language and to teach how to 
use it effectively. We describe amethyst as an evolution of concepts from various 
pattern matching systems. We will progressively see more hints of functional 
programming style. In fact amethyst turned out to be full functional language. 

Our starting point are regular expressions and we will visit several different set- 
tings, each more general than the previous one. 

Then we move to the problem of parsing with the focus on top-down parsing. 
We introduce PEG parser and generalize it to more flexible REG^^*^ parser. (In 
fact, REG^^*^ stands for relativized regular expressions.) Next we can move 
into pattern matching of tree-like structures. Finally we generalize our pattern 
matching to objects that can form arbitrary graphs. 

1.1 Notation 

For better readability our examples use syntax highlighting. We also use the 
following notational conventions: 

An example code is enclosed in a box like this. 

In examples an result of expression is written with the following syntax: 
2+2 #-> 4 

Most of the amethyst functionality is implemented by normal amethyst code 
in standard prologue file. We also show portions of standard prologue in boxes 
with color like this. 



1.2 Technical prerequisites 

We assume that the reader knows the basic syntax of Ruby language (we give 
a brief overview in the next section.) and is familiar with the basics of formal 
language theory. From Section 11.71 onward we expect an understanding of basic 
functional programming techniques. 



5 



1.2.1 Basics of Ruby 



We show several examples of Ruby expressions that we will use in later sections. 

Arithmetic offers no surprise: 

# This is a conment . 

# Expected results of expressions are denoted by 

# comment #-> result 
x=2 

y=3 

x+y*y #-> 11 

Function definition and function call are written as follows: 

def pyth(x,y) 

x*x+y*y 
end 

pyth(3,4) #-> 25 

We use the following operations with arrays: 

[l,2] + [3,4] #-> [1,2,3,4] 
x = [1,2] 

# We use splat operator to expand arrays 
[ X, *x] #-> [[1,2] ,1,2] 

# Splat can be used in function calls 
pyth(*x) #-> 5 

A closure is an important concept from functional programing |39]. Both Ruby 
and amethyst use closures. An example in Ruby follows: 

def foo(x,y) 
proc{ 
x=x+l 
x+y 

} 

end 

x=l 

z=f oo(x,2) 
z.call #-> 4 
x=0 

z.call #-> 5 



1.2.2 Getting sources 

A source of amethyst can be obtained from git repository by the following com- 
mand: 

git clone git ://github. com/neleai/mthyst .git 

This thesis refers to a version of amethyst that can be obtained by running the 
following command: 

git checkout thesis 

Examples used in this thesis are also in the doc directory of the amethyst. 

The peridot language can be obtained by: 

git clone git : //github . com/neleai/peridot .git 

Installation and running instructions are in README files. 
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1.2.3 Using amethyst 



To use amethyst in a Ruby program you need to load it first: 
require ' amethyst ' 

Then you can load amethyst source files as in the following example: 
Amethyst :: file ' example 1 . ame ' 

An amethyst source file is a Ruby source file except for grammar definitions with 
the following syntax: 

amethyst Grammar-[ 
rules 

> 

Hello world program in parser generators is a simple calculator. We follow this 
tradition too. Constructions used will be the topic of the following chapters. 

The source consist amethyst source file calculator . ame: 

amethyst Calculator { 
calculate = add_expr 

add_expr = add_expr:x "+" mul_expr:y -> x+y 

I add_expr:x "-" mul_expr:y -> x-y 
I mul_expr 

mul_expr = mul_expr:x "*" atom_expr:y -> x*y 

I mul_expr:x "/" atom_expr:y -> x/y 
I atom_expr 

atom_expr = "(" add_expr:x ")" -> x 
I float 

> 

puts Calculator. calculate("2-4+2*2~2") #->4 

and Ruby source file amethyst . rb: 

require ' amethyst ' 
Amethyst : : f ile ' calculator . ame ' 
while true 
input = gets 

puts Calculator. calculate (input) 
end 

The file calculator . rb is run by the command: 
ruby calculator . rb 

For tasks where a simple expression suffices, defining full grammar is not neces- 
sary. We can enclose arbitrary amethyst expression e in the following construc- 
tion: (I e I ). This creates an object that can be handled in a similar way as a 
regular expression. So instead of writing: 

amethyst Hello_World { 
hello = 'hello' 
world = 'world' 
hello_world = hello ' ' world 

> 

Hello_World . hello_world (s) 
one can write: 
(I 'hello world' |) === s 
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1.3 Regular expressions 



Regular expressions provide a way to match strings of text and are widely support- 
ed by many languages and libraries. They extend search and replace functionality 
of text editors like vi. Typically implementations add nonstandard extensions 
which we will not consider in this work. 

Regular expressions can be formed recursively: An expression can be an atomic 
expression that can not be decomposed (and typically matches single character) 
or expressions composed from smaller expressions by some operator. 



Atomic expressions 



Syntax 


Description 


c 


Match character c^. 




Match arbitrary character. 


[group] 


Match character described in character group. 


Operators 


ele2 


Sequencing 


el|e2 


Choice 


(e) 


Grouping 


e* 


Iteration: match e or more times. 


e+ 


Iteration: match e 1 or more times. 


e? 


Iteration: match e or 1 times. 



For example, the expression [Hh] ello (world I worlds) matches the strings 
"Hello world", "hello world", "Hello worlds", "hello worlds". 

In Ruby regular expressions are enclosed by "/". We match the example above 
against the string "hello world" by writing: 

/[Hh]ello (world I worlds)/ === "hello world" 

Note that the space is also matched literary. This becomes problematic for more 
complex expressions as they can not be reformatted. 

1.4 Amethyst grammars and expressions 

The syntax of a regular expression and its equivalent amethyst expression is 
similar. 

We embed amethyst expressions with (I e I ) syntax. The example from previ- 
ous section becomes: 
(I <Hh> 'ello ' (' world ' r worlds ' ) I) === "hello world" 

^Unless c has special meaning in which case you have to escape it. 
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Amethyst is whitespace insensitive. We need to enclose matched strings with 
single quotes. The reason why [] turns to <> will be explained in Section 11.101 



Grammars 



Expressions are useful for simple tasks. More complicated tasks are described by 
grammars. Amethyst source file consist of grammars that contain rules. Syntax 
of rule definition and calls is the following: 



Pattern 


Description 


name = e 
name 


Rule definition 
Rule call 



If we want hello world program to be whitespace insensitive we can write it as: 

amethyst Grammar { 
space = < \t\r\n> 



hello = 'hello' space+ 'world' 

> 

As a less trivial example we show rules recognizing integers: 

amethyst Grammar2 { 
digit = <0-9> 
int = '-'? digit+ 

} 

# A rule can be invoked in the following way: 
Grammar. int ( "421") #-> [ ' 4 ' , ' 2 ' , ' 1 ' ] 



Character groups 

Character groups provide a concise way to match single character from given set 
of characters. 



Following constructions are supported: 



Regular expression 


Amethyst 


Description 


[a] 

[aeiou] 
[a-z] 
[~abc] 
[[:digit:]] 


<a> 

<aeiou> 
<a-z> 
<~abc> 
«digit» 


Match character "a" 
Match any of characters "aeiou" 
Match any of characters from "a" to "z" 
Match any character except "aba" 
Match predefined class 



In definitions above characters "<>\" have to be escaped. 

There are several predefined rules to match POSIX character classes (alpha, 
alnum, digit, . . . ). User can also define custom character class, say vowels: 

vowels = <aeiou> 



And use it as character group class: «vowels>0-9>. 



^If we do not care what this rule returns. 
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1.5 Amethyst expressions 



Amethyst consist of a small set of core operators. The rest of amethyst syntax is a 
syntax sugar that is converted to core operators. Most of amethyst functionality 
is done by ordinary rules. These rules are contained in file called standard prologue 
in Appendix [HI We will show relevant parts of standard prologue as an example. 

Explaining exact semantic of core operators will take some time. In this section 
we only briefly summarize core operator syntax. Various aspects of core operators 
will be covered later. 



Basic operators 



Like most pattern matching systems amethyst supports following operators: 



Name 


Description 


el e2 


Sequencing 


el|e2 


Priorized choice 


(e) 


Grouping 


e*, e+, e? 


Iteration 



Lookaheads 

When parsing programming languages the decisions which alternative should be 
used often depend on the future input. This is done by means of lookaheads. Due 
to limited memory of computers in the 1970's the lookaheads were limited to next 
token. A PEG parser relaxes this restriction by allowing unlimited lookahead. 
Amethyst also supports unlimited lookaheads but with slightly different syntax: 



el & e2 
~e 



Positive lookahead 
Negative lookahead 



Positive lookahead is similar to intersection. If input can be matched by el then 
lookahead matches input by e2, otherwise it fails. Negative lookahead succeeds 
if and only if e fails and consumes no input. 

In amethyst integers can be recognized by a rule int. Based on first character 
one can decide if integer is positive or negative. We can use positive lookaheads 
to match positive integers and negative lookaheads to match negative integers in 
the following way: 

amethyst Numbers { 

negative_int = ~<0-9> int 
positive_int = <l-9> & int 

> 
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Atomic expressions 



We represent the following atomic expressions as the calling of a rule: 



Atomic expression 


Rule call 


Description 


'str' 
"str" 
<str> 


anything 
seqC'str") 
tokenC'str") 
regchC'str") 


Match single object (character) 
Match string str 

Match string preceded by whitespaces 
Like [str] in regular expressions 



Rules anything, seq and regch are implemented as core functionality. Rule 
token is derived and the relevant part of standard prologue follows: 

= < \t\r\ii> 
tokeii(x) = _* seq(x) 

We recommend reading Appendix [B] containing standard prologue. It is expected 
that you will not completely understand some parts no'vcl- Make a guess what 
the unknown parts do. This is the best way how to learn a new language and 
amethyst is no exception. 

Enter operator 



The operators covered so far deal mainly with matching strings. We need an 
additional operator Enter to deal with general (possibly cyclic) data structures. 
Enter operator is a powerful tool essential to sections 11.101 and 11.111 



Name 


Description 


el[ e2 ] 


Enter operator. 



An Enter operator matches el. Then it recursively invokes parser to match e2 
on the result of el. 



Enter operator is one of the most important generalizations of amethyst. It 
allows us to do pattern matching of object with high level of abstraction which 
is the topic of Section 11.101 and Section 11.111 



^Explaining them is the topic of this chapter. 
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1.6 External interface 



So far we can only decide if an expression matches an input or not. To get useful 
work done we need integrate amethyst with a programming language. Amethyst 
is designed to be language independent and the particular language that is used 
is called the host language. In this thesis we use Ruby as the host language. 

In amethyst each expression yields a result. Results can be bound to rule- local 
variables using variable binding and processed by host language expressions which 
we call sp7nantic actions. In shortcuts a denotes an anonymous variable which 
does not occur elsewhere. 

Functional languages use the notion of referential transparency [38j. Amethyst 
requires a weaker condition: execution is done in a persistent way. When an 
alternative fails we revert all changes it made and pretend they never happened. 



Semantic actions and variable binding 



The syntax of semantic actions and variable binding is the following: 



Pattern 


Expansion 


Description 


{c} 

-> c newline 
e:v 


core 
core 


Semantic action. 
Alternative syntax. 
Variable binding. 



We use Ruby closure support to capture scope as this example shows: 

(I iiit:x " + " int:y | ) .match("2+2") 
puts x+y #-> 4 



Syntax sugar for variable binding 

It is common to collect results in an array or do simple conversions. First be 
expressed by the following syntax sugar: 



e:{ c} 



e : it {c} 



Do conversion using variable it. 



For example, imagine that a third party has written a float rule. Their API 
however returns the result as a string. If we want to return a number instead we 
can write: 



float2 = float:{it.to_f} 

When collecting results into array a parameter can be an arbitrary host language 
expression not just variable: 



e: [ c] 


e ; {c= 


[*c, it] 


} 


Append result to array c. 


e: [*c] 


e : {c= 


[*c,*it] 


} 


Concatenate result to array c. 
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Semantic predicates 



It is possible to test arbitrary properties by semantic predicates which are imple- 
mented as: 



Expression 


Expansion 


Description 


~{c} 


core 
&{!c} 


Semantic predicate. 
Negative semantic predicate. 



A semantic predicate expression accepts only if a predicate evaluates to true 
otherwise it rejects. Otherwise it behaves exactly as the semantic action. 



For example, even integers are matched as follows: 
even = int : x &{ xy„2 == } -> x 

Sometimes we need to represent an expression that always fails. We define the 
following rule in standard prologue: 

fails = &{ false } 

Results of operators 

The result of an atomic expressions is typically a string matched by that expres- 
sion. 



Result of operators can be described by the following identities: 



Expression 


Expansion 


(el e2) 


: v 


el 


e2:v 






(el|e2) 


V 


el:v 1 


e2:v 






(el&e2) 


V 


el 


e2:v 






(~e):v 




e 








(e):v 




(e:v) 








e* : V 




{[]}:a 


( e: [a] 


)* 


{a}:v 


e+ : V 




[[]}:a 


( e: [a] 


) + 


{a}:v 



Note that lookaheads are always reverted. The main reason is maintainability as 
lookaheads often cause a rule to be called more times than expected. 



Results of rules 



Results of rules are passed by an instance variable of parser named @@_result. 



Expression 


Expansion 


name = exp 
rule : v 


name=exp : (§@_result 
rule {@@_result]- : V 
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1.7 Parametrization 



As Ruby functions can take parameters so can amethyst rules. The syntax for 
simple form of parametrization is the following: 



Pattern 


Description 


name(pl,p2, . . . ) = exp 
name (pi, p2, . . . ) 


Define rule with parameters pl,p2... 
Call rule with parameters pl,p2... 



This construction is space sensitive. Placing a space after name is always inter- 
preted as sequencing of rule and expression. 



Parametrization in its full generality is more complicated as will be explained in 
Section 11.131 Here we will give several examples of using parametrization. 

Simplest example of parametrization is the following: 

amethyst Adder { 
add(x,y) = -> x+y 
four = add(2,2) 

> 

Parametrized rule can be called from Ruby with an input string followed by rule 
parameters: 

Adder. add ("",2, 2) #-> 4 

Several builtin parametrized rules are included in amethyst. We have already 
seen a parametrized rules seq and token. 

Replacing is text is common task. Say we want replace "foo" with "FOO" and 
"bar" with "BAR". We can use builtin rule replace: 

Amethyst. replaceC'fooobars" ,( I ("foo" | "bar") : {it .upcase]- |)) 
#-> "FOOoBARs" 

Note that amethyst expressions can be passed also inside grammars. 
Example above can be also written as: 

amethyst Param { 

replace_f oobar = replace( (| ("foo" I "bar") : {it .upcase} |) ) 

> 

Construction above is called lambda [9]. Rule calls inside lambda are resolved 
lexically |2]. We form a closure for enclosing amethyst rule as is expected from 
lambda. 



Only two parametrized rules are core atomic expressions: 



Rule 


Description 


apply (x) 
seq(x) 


Apply lambda in parameter 

Match string or apply lambda in parameter 



Reason why apply does not accept string as a parameter is that we want to do 
resolving in the callers scope. 



Other parametrized rules just use seq and apply. In standard prologue we follow 
good practice that rule that accepts string as a parameter accepts lambda too. 
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As an example consider how rules find and replace are implemented in the 
standard prologue: 

find(exp) = ( seq(exp) :x break | .)* .* -> x 
replace (exp) = (apply (exp) | . )* : {it . j oin} 



Closer look at lambda 

Amethyst lambdas form a closure as is illustrated in the following example: 

amethyst Closure{ 

lambda(z) = I (I {puts z+=l} |) } 

test = lambda(3):x apply(x) apply(x) apply(x) 

} 

Closure. test ("") #-> 4 5 6 

Lambda can receive arguments. We can read arguments by calling method. 

This allows implicit syntax for partial application. 

par(x,y,z) = -> puts(x +y*z ) 
foo(x) = -> (I pari ,x, ) |) 



Example: Parser combinators 

Parser combinators [7] are a popular way to implement parsers by people with 
a functional programming background. They allow the construction of parser 
expressions by using the host language operators. A combinator support is easy 
to add by defining operators for amethyst lambda. An implementation follows: 

amethyst Combinators { 

plus(x,y) = -> (I seq(x) seq(y) |) 
or(x,y) = -> ( I seq(x) I seq(y) |) 
and(x,y) = -> ( I seq(x) & seq(y) I) 
not(x) = -> ( I ~seq(x) |) 
star(x) = -> ( I seq(x)* |) 

> 

class AmethystLambda 

bin_op=[['+' , :plus] , [ ' | ' , : or] , [ ' & ' , :and]] 
un_op = [ [ ' star ' , : star] ,['"',: not] ] 
bin_op . each-[ | sym , name | 

def ine_method(sym){ | x | Combinators . send (name , nil , self ,x)} 

} 

un_op . each{ | sym , name | 

def ine_method(sym){ Combinators . send (name , nil , self )} 

} 

end 

A "hello world" example when we use parser combinators becomes: 

(I 'hello' I) + (I ' ' I) + (I 'world' |) 

Moreover, a user can write: 

(I ' I ' I) I (I ' I ' I) 

instead of 

'I' I 'I' 
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1.8 Amethyst extends PEG 



Parsing expression grammars (PEG) were introduced by Ford [12]. Our parser 
started as a PEG parser but evolved into a more general language. In this section 
we explain the similarities and differences between PEG and our parser. 



PEG operators 



A typical PEG parser defines expressions formed by the following operators: 



Operation 


Description 


el e2 


Sequencing 


el|e2 


Ordered choice 


e*,e+,e? 


Iteration 


&e,~e 


Lookaheads 



Sequencing and choice 

Parsing expression grammars achieve linear time by making choice deterministic 
and by memoization. In PEG ordered choice tries alternatives at left to right 
order and when an alternative succeeds it does not try further alternatives. 

Amethyst extends this choice to priorized choice that does backtracking. Linear 
time is obtained by adding several natural conditions to recursion as is described 
in Chapter [21 

To describe semantic of our parser we chosen to define auxiliary constructs Cut 
and Stop that simplify description d. A compiler may use different representation 
for example one defined in [2l 

Our choice operator tries alternatives in left to right order. When an alternative 
succeeds it does not try further alternatives. We extend choice with Cut operator 
that when encountered it prevents parser form trying other choices. This allows 
more trackable description of the lookaheads. 



Then behavior of operators from the Section [T3] can be described by the following 
table: 



Operation 


Expansion 


Description 


e? 

Cut 

~e 

el & e2 


eknii> 
auxiliary 

e Cut fails I {nil} 
~~el e2 


Make e optional. 
Like ! in prolog 
Negative lookahead. 
Positive lookahead. 



•^similar situation is extending reals to complex numbers. 
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Iteration 



To describe iteration terminating conditions we define an additional atomic ex- 
pression: 



break core To terminate iteration. 



We explain iteration by auxiliary repeat-until operator. Repeat-until repeatedly 
tries to match its body and ends only after Stop is encountered. If that iteration 
fails then repeat-until fails. 



e** 


auxiliary 


repeat-until 


Stop 


auxiliary 


Stop iteration 


e* 


e** 


When e contains Stop, 


e* 


(el Stop)** 


otherwise. 


break 


Cut Stop 


Possible expansion. 



Examples 

Operators Cut and Stop were introduced to describe a semantic of break. A 
common task is to collect characters until certain character occurs. An until 
rule defined in standard prologue has following implementation: 

until (chr) = ( seq(chr) break 
I '\\':[x] .:[x] 
I .:[x] 
)* -> x.join 

For example, rule line can be implemented as: 

newline = '\r\ii' | '\r' | '\n' 
line = until (I newline |) 

Some functionality of C standard library can be translated into amethyst as: 



C variant 


Amethyst variant 


scanf ("%i") 
scanf (""/of ") 
scanf ("°/.[xyz] ") 
scanf ("%s") 
gets 


int 
float 

untiKI <xyz> I) 
until (1 _ 1) 

line = untiKI newline I) 
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1.9 Inheritance 



In object oriented languages inheritance is a form of reusing code by subtyping 
existing objects. In Ruby class names must be capitalized. Ruby has simple 
inheritance with mixins as is shown it the following example: 

class Foo 

def foo; 42 ;end 
end 

puts Foo. new. foo #-> 42 
class Bar < Foo 

def foo; super+1 ;end 
end 

puts Bar. new. foo #-> 43 
module Baz 

def foo; super*2 ;end 
end 

class Bar < Foo #class can be defined piecewise. 

include Baz #include module 
end 

puts Bar. new. foo #-> 85 

#Ruby implements mixin by inserting class between 
#current and parent class 



Inheritance in amethyst 

We reuse Ruby class system for inheritance. Ruby class names must be capital- 
ized. 

amethyst Foo { 

foo = {42 I 

> 

ainethyst_module Mod { 
foo = super :x { 2*x } 

> 

amethyst_module Baz -[ 
baz = {"baz"> 

> 

class Bar < Foo 

include Mod 
end 

amethyst Bar < Foo { 

foo = super :x {x+1} 

foo_orig = Foo: :foo 
baz = Baz: :baz 

> 

One can use Grammar : : rule syntax to call rule from ancestor or rule from module. 
Calling rule of arbitrary grammar is not allowed. 
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1.10 Pattern matching of tree-like structures 



Amethyst takes inspiration from an OMeta (2007) [12] which extended parsing 
expression grammars (2002). One of extensions made in OMeta is pattern match- 
ing of tree-hke data structures. We further extend this work in several respects. 
One described in the next section is extending pattern matching to arbitrary data 
structures with possible cyclic references. 

All operators defined so far carry over into this setting. An Enter operator and 
parametrized rules are essential for this transition. 

1.10.1 Pattern matching in functional languages 



Most functional languages offer limited form of pattern matching. While syntax 
is different it usually boils down to the following constructs: 



Expression 


Description 


Struct 
: X 

expl exp2 
expl 1 exp2 
[ exp ] 


Match when it is described structure with given name 

Bind head to variable x 

Sequencing 

Choice 

Enter - take head and match it recursively with exp 



Our framework extends these operations with iteration constructs and other fea- 
tures. 



1.10.2 Matching nested arrays in amethyst 



Recall following amethyst operators. For matching arrays Enter can omit leading 
". " as syntax sugar. 



Expression 


Expansion 


Description 


el[ e2 ] 
[ e ] 


core 
core 
. [ e ] 


Match single element 
Enter operator. 
For nested arrays. 



Then matching of arrays is quite natural: 

(I . :x [. :y . :z ] I ) .match( [1 , [2,3]]) 
puts x,y,z #-> 123 



1.10.3 Classes and pattern matching in Ruby 

A common convention to construct tree like structures in Ruby is to define [] 
class method. One way how to construct syntax trees in Ruby source code is the 
following: 

Plus [1 , Times [2,3, Plus [4 , Plus [1 , 5] ] ] , 3] 
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In Ruby membership is tested by case construct. We demonstrate it on contrived 
implementation of logarithm: 



def loglO(x) 
case X 
when 
when 1 . . 9 
when 10. .99 
when Float 
else 
end 
end 

A case match is done by invoking === method. Use cases of this method are 
diverse as demonstrated by the following examples: 



raise "not defined" 

1 

2 

loglO(x.to_i) 
loglO(x/10)+l 



Left argument 


Test performed 


true, false, nil 
42, 3.14 
-42. .42 
Class 
/exp/ 


Equality. 
Equahty. 

Range membership. 
Class membership. 
Regular expression match. 



1.10.4 Class membership 

Implementation of matching of basic types depends on the host language. We in- 
form amethyst about this by defining a parametrized rule member. To implement 
member rule in Ruby we use === operator from previous section: 

member (x) = .:a &{x === a} {aj- 

We define tests for following basic types: 



Expression 


Expansion 


true, false, nil 


member (true),... 


42 


member (42) 


-42 . . 42 


member (-42. .42) 


Class 


member (Class) 
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1.10.5 Building abstract syntax trees 



When we parse we typically build some abstract syntax tree. The following 
atomic expression makes creation of AST more convenient. 



Expression 


Expansion 


Description 


©Class 


iCiass . create (local_variables) } 


Create object. 



This syntax also encourages proper naming of variables. Assume we want to 
change calculator from first section to produce a syntax tree. Possible implemen- 
tation is: 

class Add 

def self . create (hash) 

a=Add . new 

a.x=hash[:x] 

a.y=hash[:y] 

return a 
end 

def amethyst_array 
[®x,@y] 

end 
end 
# . . . 

amethyst Calculator_AST { 
calculate_ast = add_expr 

add_expr = add_expr:x "+" mul_expr:y ©Add 

I add_expr:x "-" mul_expr:y OSubstract 
I mul_expr 



niul_expr = mul_expr:x "*" atoni_expr:y ©Multiply 

I mul_expr:x "/" atoni_expr:y ©Divide 

I atoni_expr 

atoni_expr = "(" add_expr:x ")" -> x 

I float 

} 
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1.10.6 Pattern matching of tree like structures 

Amethyst can match any object as an array. FirstEl amethyst tries to call method 
amethyst _arr ay and to do matching on returned array. If amethyst_array 
method is not defined it pretends that empty array was returned. This is useful 
when we match arbitrary objects. 

Enter operator combined with property testing allows us write evaluator to cal- 
culator in the following way: 

amethyst Evaluator < Calculator_AST { 
eval = Add[ eval:x eval:y ] -> x+y 

I Times [ eval:x eval:y ] -> x*y 
I Multiply [ eval:x eval:y ] -> x*y 
I Numeric 

calculate = calculate_ast=>eval 

> 

Evaluator. calculate ("2+2") #-> 4 

Example above was created to show possibility of the following simplification: 

amethyst Evaluator < Calculator_AST { 

eval = Add[ eval:x eval:y ] -> x+y 

I (Times | Multiply) [ eval:x eval:y ] -> x*y 
I Numeric 

calculate = calculate_ast=>eval 

> 

Evaluator . calculate ("2+2") #-> 4 

If we wanted to also represent addition as a ra-ary operation we could extend 
previous example in the following way: 

amethyst Evaluator < Calculator_AST{ 

eval = plus [ eval:x eval:y] -> x+y 

I (Times | Multiply) [ eval:x eval:y] -> x*y 
I Numeric 

plus = Add 

I Plus[ .:first .+:rest ] -> Add [first , Plus [*rest] ] 
I Plus[ .:first ] -> first 

calculate = calculate_ast=>eval 

> 

# 2 + 3 + 2*2+1+2*2 
Evaluator. eval (Plus [2, 3, Multiply [2, 2] ,1, Times [2, 2]]) #-> 14 

Evaluator . calculate ( "2+2" )#-> 4 

Rule plus shows a way to archive an independence of representation. It allows 
us to freely switch between a representation of addition as an array of summands 
and a recursive representation. 



''Unless object is a String or Array 
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1.11 Matching arbitrary objects 



Most of the rules are written to match element of an array. The Pass operator 
that behaves like the Enter operator except it wraps the first result into one 
element array. 



Name 


Expansion 


Description 


el=>e2 


el: a {La]}[e2] 


Pass operator. 



As Ruby is object oriented language you can discover state of object only by 
method calls. Inside Enter operator you can call matched object methods. The 
syntax is the following: 



Atomic expression 


Description 


Omethod 
@inethod(al,a2) 


Call method of matched object. 
Call method with arguments. 



You can call matched object methods inside semantic acts with same syntax. 



Note that disambiguation between method call or object creation is based on the 
fact that in ruby all class names are capitalized. 

Example: 

We can also write evaluator by accessing object methods: 

amethyst Evaluator { 

eval = (Add | Times | Multiply) [ @x=>eval:x (§y=>eval:y 

-> @is_a?(Add) ? x+y : x*y 

] 

I Numeric 

> 

Evaluator. eval (Calculator_AST.calculate_ast( "2+2*2" ))#-> 6 

We can match arbitrary objects, for example hashes. 

amethyst Match_Hash { 

match = Of etch( :b)=> [ . :x . :y ] 

-> x*y+@fetch(:a)+@fetch(:c) 

h = {:a=>l, :b=>[2,2] , :c=>4} 
Match_Hash. match (h) #-> 9 
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1.12 Dataflow analysis generalizes tree traversal 



Dataflow analysis (20] is important technique in compiler optimization. It gener- 
alizes tree traversal to handle cyclic dependencies. 

We illustrate dataflow analysis on real world example that amethyst needs to 
solve. We flrst start with two simpler problems where simpler approach is sufli- 
cient until we get into a situation where dataflow analysis is necessary. 



First example: In regular expressions 

We are given a regular expression and want to know a minimal size of string that 

matches this expression. For simplicity we are given expression as syntax tree 

consisting only of immutable Or, Seq, Char nodes for binary choice, sequencing 

and to match character. 

amethyst Regexp_niinimal_size { 

value = Seq[ value :vl value :v2 ] -> vl+v2 

I Or[ value :vl value :v2 ] -> niin(vl,v2) 
I Char -> 1 

> 

puts Regexp_minimal_size .value ( #abc|de 
Or[Seq[Char['a'] ,Char['b'] ,Char['c']] , 
Seq [Char ['d'] ,Char [ ' e ' ] ] ] ) 

#-> 2 



Second example: Adding rules 

Now we add rule calls represented as an immutable node Rule containing link to 
body to execute. An example follows: 

# foo = 'foo' 

# bar = ' bar ' 

# foobar = foo bar 

foo = Seq[Char['f '] ,Char['o'] ,Char['o']] 
bar = Seq[Char['b'] ,Char['a'] ,Char['r']] 
foobar = Seq [Rule [foo] , Rule [bar]] 

As far as no recursion is present we can modify our traverser into: 

amethyst Rules_minimal_size { 

value = Rule [ value : v ] -> v 

I Seq[ value :vl value :v2 ] -> vl+v2 

I Or[ value :vl value :v2 ] -> min(vl,v2) 

I Char -> 1 

> 

Rules_miiiimal_size. value (foobar) #-> 6 
And it will always terminate and produce the correct result. 
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Dealing with recursion 

When recursion is present then dataflow analysis becomes necessary. 

Dataflow analysis is a method of solving sets of monotonic equations over ar- 
bitrary lattice. In our case we use the lattice associated to ordering of natural 
numbers. We interpret value rule in previous example as an inequality that 
bound a size of expression based on sizes of its subexpressions. We implement 
the well known worklist algorithm j20j using it to flnd a minimal solutions of the 
dataflow equations that correspond to our inequalities. 

The algorithm starts with a setting everywhere a value zero. This violates some 
inequalities. When an inequality is violated we increase left size to value of 
right side. We repeat this until all inequalities are satisfled. We use algorithm 
by inheriting from Dataflow grammar that is implemented in the next section. 
Algorithm terminates when all inequalities are satisfled and each value attains 
minimum among all solutions. 

We do not have to change our code much to use this analysis: 

amethyst Rules_mininial_size < Dataflow { 
flow = Rule [ visit :v ] v 

I Seq[ visit:vl visit:v2 ] -> vl+v2 

I Or[ visit:vl visit:v2 ] -> min(vl,v2) 

I Char -> 1 

> 

class Rules_niinimal_size < Dataflow 

def lattice_bottom ; ; end # Starting solution. 

def lattice_join(x,y) ; max(x,y); end 
end 

A monotonicity in our case means that if we increase value in right side then 
corresponding value at left side can not decrease. This is in our case true. 

Conditions in which this algorithm terminates is flnite height of a lattice. This 
means that for every value we can bound number of increases until we reach 
this value by same constant. If no recursive rule without terminating condition 
is present then we know that minimal solution satisfles this condition and our 
algorithm terminates. 



1.12.1 Implementing dataflow analysis 

This analysis deals only with immutable objects. It is possible to support mutable 
objects if you add timestamps to inform if object was changed or not. 

In this section we describe a variant of incremental dataflow analysis [5¥]. We 
added two new properties. 

First property is that analysis is dynamic. User does not have to construct data 
flow graph. A graph is learned automatically and dependencies vary based for 
different value assignments. This property can naturally describe concepts like 
shortcircuit evaluation or flow-sensitive analysis. 
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Second is that analysis is lazy in sense that it does not compute values until they 
are necessary to compute. 

A simple implementation of our analysis follows. 

An amethyst interface is the following: 

amethyst Dataflow { 

visit = . :x {depends (x) ; ®®vals [x] } 

root = . :x analyze (x) 

getvalue(v) = {@@vis=v; v]-=>visit 

> 

And a simple analysis based on worklist algorithm follows. 

class Dataflow < Amethyst 
def value (e) 
@active={} 
@activea= [e] 
while el=@activea.pop 
©active . delete (el) 
©depend . delete_all_edges_to (el) 

val=getvalue (el) 
val=lattice_join(val, Ovals [el] ) 
if val > Svals [el] 
Svals [el] =val 

©depend . edges [el] . each{ | d | addactive (d) } 
end 
end 

©vals [e] 
end 

def depends (e) 

©depend. add (e, ©vis) if ! ©depend. edges [e] . include? (©vis) 
if ! ©visited [e] 

©visited [e] =true 

addactive (e) 
end 
end 

def addactive (e) 
if ! ©active [e] 
©active [e] =true 
©activea«e 
end 
end 

def initialize 

©depend=Oriented_Graph . new 

©vals=Hash . new (lattice_bottom) 

©visited={} 
end 
end 
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1.13 Parameters as an object 



Now we covered enough background to decribe the full form of amethyst 
parametrization. 

In Ruby we can pass parameters in several ways: 

def a(x,y) 

x+y 
end 

puts a(2+2) #->4 
def b(x=l,y=2) 

x,y 
end 

puts b(2) #->4 
def c(x,y,*ary) 

ary. inspect 
end 

puts c(l,2,3,4,5) #-> [3,4,5] 
def with_block 

yield(l) + yield(3) 
end 

with_block{ |x| x+3} # -> 10 

# rubyl . 9 emulates named parameter by passing last parameter 

# as hash and allowing to omit {} . 
def named(x) 

puts X. inspect 
end 

named(:foo=>l, :bar=>2) #-> { : f oo=>l , :bar=>2> 

Amethyst parametrized calls are done by creating special object and pattern 
matching it againist definition. 

We can describe them by the following shortcuts: 



Pattern 


Description 


name (pattern) =e 
name^^l ^9 keyl : vail) 


.name (args) = {args}=>pattern e 

_name (Arguments [ [al , a2] . ^ Vpv1 =>val 1 >] ) 


We use a convention similar to block passing in Ruby. 


e cl,c2)(| e2 I) 


e(cl,c2,(| e2 I)) 



In matching arguments we use different syntax. Following syntax express idioms 
common in argument passing more directly and extends syntax of Ruby argument 
passing. 



name 

*naine 

name : e 

©name 

@name : e 

name=val 

Oname=val 



. :name 
. * : name 
e : name 
Oname :name 
(9name=>e : name 
( . I {val}) :name 



Positional argument 
Splat operator 
Match amethyst expression 
Named argument 

Match named argument with expression 
Optional argument 
Optional named argument 
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Example with use cases follows: 

amethyst Parametrizatioii{ 
opt(x,y=l) = x+y 
use_opt = opt (1,2) #-> 3 
use_opt2 = opt(l) #-> 2 
multi(x,*y) = -> y 
use_multi = inulti(l,2,3) #->[2,3] 

check (x: String, y: String) = -> x+y 
use_check = checkC'a" , "b") #->"ab" 
use_check2 = check(l,2) 

I -> "failed" #-> "failed" 

named (@x=l , @y=2) = -> x+y 
use_named = named(x:3,y:3) #-> 6 
use_named2 = named (x: 2) #-> 4 

use_named3 = named (y:l) #-> 2 

} 



Example: Syntax highlighting 

A syntax highlighting in this thesis was relatively simple to implement by amethyst 
parser. This example relies on parametrized rules. 

Consider the following simplified part of amethyst grammar: 

postfixed = term 

( '=>' term 

I ' [ ' expression " " ' ] ' 
I <+*?> 

I ' : ' ' [' (key | name) '] ' 
I ' : ' (key | name) 

I inline_host_expr 
)* 



This grammar can be annotated by colors in the following way: 



postfixed = 



term 

( color("blue" ) ( 
color("blue" )( 
color ("black") ( 
color ("green") ( 
color ("green") ( 
inline_host_expr 

)* 



'=>' I ) term:e 
' [' expression :e ]' |) 
<+*?> I ) 

' : ' ' [' (key | name) '] ' | ) 
' : ' (key | name) | ) 



#a sample implementation of color can be 
color(col,lam) = {pos}:oldpos apply (lam) :r 
{color_by (col , oldpos ,pos) } 



-> r 



This approach prevents any changes to the actual text representation of the in- 
put (as opposed to translating abstract syntax trees back to text form). The 
annotation is straightforward. It is realistic to expect advanced users of IDE to 
write new grammars if amethyst was used as a syntax highlighting engine. This 
approach also benefits from a dynamic parsing (Chapter [3]). 
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1.14 Taming state 



Purity is important concept in programming languages. We say that function 
is pure when it can not produce any side effect. Advantage of pure functions is 
that they are easy to reason about. When function is not pure then its behaviour 
depends on operations made before that function, also known as state. Often 
we have to add state to function as a necessary evil. We will present several 
constructions that make state behavior more predictable. 

We take inspiration from several earlier attempts. We could view Warth's worlds 
[l2] as first attempt. However as worlds are applied only for position tracking 
so all work is left to programmer. The rats parser [16] recognizes problem and 
proposes transaction. Again bookkeeping is left to programmer. In general setting 
Tanter's contextual values [35] are more general than Warth's worlds [12]. 

Amethyst uses similar idea. For modality reasons we must split contextual values 
to two cases: Contextual argument and return. 

Local state 

Local state refers to how can values of local variables inside function change. 
Functional languages use notion of referential transparency 08]. We use weaker 
notion. When an alternative fails the we revert all local variables to a state just 
before alternative was tried. 

Lookaheads are especially dangerous because they break assumptions program- 
mer makes about state we always revert to state before lookahead. 

Reverting of local state may come as a surprise in a context of initialing variables: 

foo = {x=4]- fail I success {puts x} #-> nil because x=4 was reverted, 
foo = -[x=4]- (fail I success {puts x}) #-> 4 

This can be done effectively by data structures that do a backtracking persistence. 
Global state 

Handling global state is more tricky. Memoizing parsers cause objects to be 
shared unexpectedly. Following example returns a modified object instead of 
correct unmodified one. 
foo:x {x.a=4} bar | foo 

Here backtracking persistence can not help as memoized value would be reverted 
back to nil. 
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There are ways how mitigate this problem. 

• Blame the programmer. 

• Recursively clone everything. When naively done we are about as slow as if 
we would recalculate everything. Using full persistence can typically reduce 
overhead to constant factor [JJJ. Disadvantage is that all user structures 
must support persistence. 

• Recursively make every result immutable. This preserves time complexity 
as we make every object immutable at most once. 

We chosen the last alternative as it is conceptually simplest alternative. 
Dynamic parsing benefits from immutability as we will see in Chapter [31 



1.14.1 Contextual arguments and return 

Are a more transparent way to model a global state than by global variable. 

For supplying context we use contextual argument @>name. A contextual ar- 
gument is accessible to all rules that current rule calls. However a change of 
contextual argument in son does not change parent's contextual argument. We 
illustrate this on example: 

fool = {@>name="f oo"} foo2 {puts @>name]- #-> foo 

foo2 = foo3 

foo3 = -[@>name=@>iiame+"bar"]- foo4 {puts @>name]- #-> foobar 
foo4 = {puts @>name} #-> foobar 

Second most frequent use of global state is to collect some values that are incon- 
venient to collect directly. 

A contextual return @<naine is concept dual to contextual arguments and can be 
viewed as a set such that every parent gets union of contextual returns of his 
sons. This also elegantly handles case when contextual return does not return 
anything. Again we illustrate contextual return on example: 

fool = foo2 {puts @<names} #-> ["f oo" , "bar" , "baz"] 

foo2 = foo3 

fooS = foo4 suppress bar 

foo4 = { @<naines « "foo" > 

suppress = sup {@<iiames= [] } 

sup = { @<names « "suppressed" } 

bar = { @<names « "bar" } baz 

baz = { (2<names « "baz" } 



By defining contextual arguments and returns in this way a memoization respect- 
ing global state becomes trackable. 

We defer describing how to implement these concepts into Section 11.141 
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1.15 Maintainance 



One of the design goals of amethyst is to allow users write general purpose gram- 
mars that can be extended as the described language or protocol evolves. 

To get a specification of a language or protocol write: 
Amethyst : : pull ' grammar : version ' 

Which loads given version of grammar, downloading it from central repository if 
necessary. Grammar obtained in this way is immutable and will be always same 
on all machines. We expect from grammars in repository to be stable and do not 
change often. 

However we expect that protocols will evolve. We want to make updating easier 
by migrations. A proposed command is: 

ainethyst_migrate file grammar :newversion 

Which will replay refactorings (for example renaming a rule) described in migra- 
tion files to new version. This could not be always possiblqj in this case we ask 
programmer to do migration manually. 

Migration from regular expressions 

We also want to make transition from other framework easier. As a simple ex- 
ample we implemented an functor that convert subset of regular expressions into 
amethyst expressions. Usage is the following: 

regexp= /[Hh]ello (world | worlds) / 

reg2ame (regexp) . inspect #-> (| <Hh> ' ello ' (' world' 1' worlds ' ) |) 
reg2ame (regexp) === "hello world" #-> true 

1.16 Error handling 

An error detection is important topic on its own. We implemented only a 
simple strategy that detects misplaced parethness and suggest probable caus- 
es. This is a type of problem that that needs global error recovery. It can be 
formulated as problem that a given sequence of parentheses what is minimal 
number of parentheses we have to change to get properly parenthised expres- 
sion. A simple strategy to guess most probable places can be found in files 
amethyst/error_recovery.ame and lib/repair_errors .rb. 

Position tracking 

For position tracking our approach is simple. We subclass string to the class 
Origin_Tracking_String. Information about position automatically propagates 
through parser and subsequent pipeline. This also allows to wrap and recursively 
parse substrings while preserving position information. 

^Fox example code that evaluates strings from standard input. 
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1.17 Example: Parser of amethyst 



We conclude this chapter by explaining amethyst in terms of itself by providing 
amethyst parser in amethyst. Summary of constructions used is in Appendix 1X1 
We omit several parts that are too technical. 

Our first task is parse rule and variable names. 

name = (<_a-zA-Z> <_a-zA-Z0-9>*) [] : {it . join} 

className = ( <A-Z> <_a-zA-Z0-9>*) [] : {it . join} 

File structure 

Amethyst file consist of grammars and host language code. We make "amethyst" 
a keyword otherwise grammar with error would be interpreted verbatim. 

file =( grammar 
I lambda 

i "('amethyst' _) . )* 



Grammars 

Amethyst grammar consist from rules. 

grammar = 'amethyst name ("<" "" name | {"Amethyst"} ): parent 
"{" rule*: rules "}" ©Grammar 

We specify optional parts of grammar by a "?" operator. When we also want to 
supply default value we use an idiom ("<" "" name I {"Amethyst"} ). 



Rules 

argsOpt = argsC (' , ') ') | {[]} 

#For now you can imagine that args( '(',')' ) matches properly 
#nested parentheses. 

rule = "" name: name ~_ argsOpt:args "=" expression: body ©Rule 

call = className : klas '::' name:name ~_ argsOpt:arg #Foreign rule 

I name: name ~_ argsOpt:arg -> Apply [name , arg] 

When we want to tell an user reading grammars that whitespaces are forbidden 
at certain place we use an ~_ idiom even if it is not neccessary for parser. 
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Sequencing and choice 



Sequencing and choice have the usual precedence. At choice we need to forbid 
interpreting end of lambda as choice. Whitespaces separate sequence elements. 
We use negative lookahead ~rule_head to separate rules. 

expression = listOf ( ( I sequence I ) , ( I " I " "')'!)): ary (§Or_AST 
sequence = (~rule_head lookaheads) * : ary @Seq_AST 
rule_head = "" name: name ~_ argsOpt:args "=" 



Lookaheads 

Negative and positive lookaheads are recognized by the following expressions: 
lookaheads = "" neg_lahead: [s] ("&" ~"{" "" neg_lahead: [s] )* 

neg_lahead = '"' ~"{" neg_lahead:m -> Lookahead [m, true] 

I <&~>:n ~_ inline_host_expr : e -> Pred[e,n=='&'] 
I postfixed 



Postfixes 

Note that postfixes are left-associative. In particular a=>b? is equivalent to 
(a=>b)?. 

postfixed = term: from 

( <*+?> 

I '[' expressionie "] " -> Enter [from, e] 

I '=>' expression :e > Pass [from, e"" 

I ' : ' ' [ ' name ' ] ' | ' : ' "" name "" | ' : ' name 

I ':' inline_host_expr:{Seq[Bind["it" .from] , Act[it]]> 

) : from ) * -> from 



Atomic expressions 



Various atomic expressions are handled by the following rules: 



cases = 


className : klas ~ ' : : ' 


-> 


Apply ["clas" ,klas] 


1 


(number ('...'!'••') number 






1 number) [] :num 


-> 


Apply ["member" , num. join] 


1 


'<' until('>' ):s 


-> 


Apply ["regch" , "/ ["+s+"] /"] 


key 


'(!' expression: exp '!)' 


-> 


Lambda [exp] 


1 


'0' className 


# 


Technical 


1 


' ' name : name argsOpt : arg 


-> 


Key [name , arg] 


1 


' @@ ' name : name 


-> 


Global [name] 
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Semantic actions 



Recognizing semantic actions is dependent on the host language. We do not have 
to understand whole grammar, recognizing pairing tags sufiiceqj. Implementation 
specific to Ruby follows: 

args(o,c) = seq(o) hostarg*:r seq(c) -> r 

hostarg = key 

I argsC (' , ') ') I argsC [' , '] ') I args ('{','}' ) 

I '\" untilCX' ') 

I ' " ' interpolated 

I '#' line 

I < ()[]{}> 



interpolated = ( ' " ' break 

I args C '#{','}' ) 
I 'W? .)* 

inline_host_expr = args ('{','}' ) 
host_expr = inline_host_expr 

I '->' line:s -["{"+s+">"}=> [ inline_host_expr ] 



Note that rule args as example of parametrized application. Also note how we in 
host_expr we parse recursively. 

Now we can put everything together to form term: 

term = cases 
call 

key: {Act [it] > 
host_expr 



expression: e "] " 

until ("" ):s 
' untilCX' ') :s 

line : s 
expression :x ") [] 
expression:?: ")" 



-> Apply ["anything"] 

-> Enter [Apply ["anything"] ,e] 

-> Apply ["token" ,quote(s)] 

-> Apply["seq" ,quote(s)] 

-> Comment [s] 

-[x}=>collect 

-> X 



'We use more complicated rules to recognize local variables 
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2. Implementation 



In this chapter we describe main techniques used in amethyst parser generator. 
We introduce novel notion of structured grammars and formahsm of relativized 
regular expressions that enables us to produce effective top-down parsers for wide 
family of languages. 

A top-down parsing implementation can be viewed as bunch of mutually recursive 
functions recognizing individual rules in grammar description. Top-down parsers 
are easy to implement and furn fast for simple grammars. 

But naively implemented parser of the following rule: 
R="aa" R I "a" R 

on "aaaaaaaaaaaaaaaaaaa..." can take exponential time. 

Incorporating left recursion also causes problems. A naive parser of 
L=L a 

would call L infinitely many times. 

In natural language processing we typically want to enumerate possible interpre- 
tations of ambiguous grammar. 

Frost [13] gave 0{n'^) algorithm that outputs compact representation of all parses 
[36] and handles left recursion as recursive descend. Parsing expression grammars 
allow unlimited lookahead. Okhotin |29] suggest to extend context free grammars 
with lookahead to class of boolean grammars. Again his algorithm for boolean 
grammars had complexity O(n^). Both these algorithms were improved by variant 
of Valiant algorithm [H] to obtain complexity 0{M{n) logn) where M{n) is time 
of matrix multiplication. When boolean grammars are restricted to unambiguous 
boolean grammars there exists O(n^) algorithm. 

For programming languages ambiguity is undesirable. One of approaches are 
parsing expression grammars defined by Ford [12]. A parsing expression gram- 
mars (PEG for short) can be viewed as a top-down parser that places three 
additional constraints. First is that rules are deterministic. Second is restricting 
choice operator I to ordered choice operator /. Once an alternative of ordered 
choice succeeds then choice succeeds then we do not backtrack if something fails 
later. Third is that iteration is greedy and does not backtrack. 

This definition without backtracking introduced problem of prefix hiding, an ex- 
pression "a"/"ab" does not match string "ab". 

Seaton in his Katahdin language [53] uses different longest choice operator to 
partially solve this problem. The longest choice tries all alternatives and deter- 
ministically chooses the longest match. However this does not eliminate the prefix 
hiding completely. Parser of: 

"J'* "^f oo" (* is iteration operator) still does not match 'Lfoo". 

We take another approach. Programming languages use only two types of recur- 
sion: iteration and nested recursion. By making this information explicit we can 
generate linear time parsers that are equivalent to the fully backtracking ones. 
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We present new formalism of relativized regular expressions REG . Our for- 
malism relaxes determinism of PEG grammars. As in PEG we support arbitrary 
lookaheads. Previous results can be easily derived using REG^^*-' formalism. 

Although REG^^^ seems stronger than PEG we show that PEG and REG^'^*^ 
are equivalent. 



2.1 Structured grammars 

We devise approach to describe programming languages which we call structured 
grammars. We build on an analogy with structured programming languages. 

As programs used arbitrary goto constructs, grammars use arbitrary forms of re- 
cursion. To make programs more readable, programming languages was extended 
by adding structured control flow constructs making it easier for developers to 
read the code on a local basis without spending hours to understand the whole 
context. We seek similar goals with introduction of structured grammars. 

Assume we are given a grammar for the fully-backtracking top-down parser. We 
say it is structured grammar if it satisfies the following conditions: 

1. Transparency of semantic actions. We can imagine that parser is augmented 
by an oracle that may decide that alternative will eventually fail. The parser 
should display same output regardless if we tried that alternative and failed 
or used the hint from the oracle. Lookaheads form important case. We 
always revert actions made by lookaheads. 

2. Recursion is restricted to iterative and nested recursion. 



(a) Iterative: For example, arguments of function in C are lists of expressions 
separated by " , " . We typically use iteration * operator. Iteration can 

be also described by left recursive or by right recursive rules. When 
possible iteration should be written in way that is associative. 

(b) Nested recursion: What is not iteration can be described by start and 
end delimiters. We require user to annotate this concept by operator 
nestedistart, middle, end) . 

Simplest example are properly parenthised expressions. They can be de- 
scribed as: 

exp = nested ( ' ( ' , ( I exp I ) , ' ) ' ) 

We show two less trivial examples in structured grammar formalism. A 

while loop in C is matched by: 

'while' exp nested('{' , (I stmts I),'}') 

Python uses indentation to describe nesting. We use a semantic predicate 
to find where we end. We match python while loops in amethyst as: 
nestedCd '\n' ' '*:x 'while' exp I), (I stmts I), 
(I &('\n' ' '*:y &{x.size>y.size}) I)) 
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A nesting should satisfy three natural conditions. 

2.1. Position of end delimiter is determined by position start delimiter. 

2.2. When nested starts in smaller position it should end in strictly larger 
position. 

2.3. When both nested (start, midJ, end) and nested (start, midS, end) 
match string then their end positions should agree. 

Note that programming languages implicitly follow this convention. Other types 
of recursion are undesirable because user can not reason about them locally. 

One of reasons is that programming languages were described as deterministic 
context free grammars. Thus they can be written by deterministic push-down 
automaton. We can model push/pop pair by calling nested. Indeed if we did 
not include lookaheads our class would be equivalent to class of deterministic 
context free grammars. 

Structured grammars offer additional advantages. For example, wc can use the 
structure information to semiautomatically construct error correction tool. 

For equivalence with top-down parser our parsing algorithm needs condition 2.1. 
Without condition 2.2 a parser would be quadratic instead linear time. Condition 
2.3 is design guideline which is not needed in our algorithm. 
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2.2 PEG and REG^^^ operators 



In this chapter we use PEG operators as originally defined by Ford [12]. 



's' 


Match string. 


r 


Rule application. 


el e2 


Sequencing. 


el/e2 


Ordered choice. 


e* e+ 


Iteration. 


&e ~e 


Positive and negative lookahead. 


{a} &{a} 


Semantic action and predicate. 



We relax determinism of PEG to REG expressions. We can describe every 
structured grammar by REG^^*^ rules with linear time guarantee. A REG^^*"^ 
expressions mostly use the same operators as PEG. Difference is that operators 
do backtracking except of nested which behaves deterministically. 



nested(start ,niid,end) 


Nested operator. 


el|e2 


Priorized choice. 


e* e+ 


Backtracking iteration. 


el[e2] 


Enter operator. 



2.2.1 Simple algorithm 



We will describe our parser in functional programming style pseudocode in con- 
tinuation passing style [15]. We denote lambda as: 
\lambda(arguments){body} and call it with call method. 

We start with simple implementation and will progressively add more details. 

A REG^^*^ parser behaves mostly as a top-down parser. We use the function 
match(e , s , cont) where e is expression we match, s is current position and cont 
is a continuation [31] represented as lambda. 



match( r ,s,cont) 
matchC'c' ,s,cont) 



match(body (r) ,s,cont) 
if s.head=='c' 
else 



; cont . call (s . tail) 
;fail 



match(e f ,s,cont) = match(e,s,\(s2){ match(f ,s2,cont) } 
match(e|f ,s,cont) = if match(e , s , cont) ; success 

else ;match(f , s , cont) 



match(~e ,s,cont) = if match(e,s,\(s2){success}) ;fail 

else ; cont . call (s) 

match(e* ,s,cont) = 

cont2 <- \(s2){ if match(e,s2,cont2) 

else 

} 

cont2 . call 



; success 
; cont . call (s2) 
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Pseudocode above describe naive top-down parser. For REG^'^^ class we restrict 
recursion and add nested operator: 

match (nested (st , mi , en) ,s,cont) = 

s3 <- match((st mi en) ,s,\(s2){success}) 
if s3 ; cont . call(s3) 
else ; fail 



2.2.2 Equivalence with top-down parsers and PEG 

We prove that for structured grammars REG^^*"^ parser finds same derivation 
as fully backtracking one. As top-down parser does not directly support left 
recursion we do not consider left recursion in this section. 

An implementation of the fully backtracking parser is same as the implementation 
of REG^^'"' parser in Section [2.2.11 except of nested: 

match(nested(st ,mi , en) , s, cont) = match((st mi en), s, cont) 

For sake of proof we transform rewrite implementation of nested in REG^^'^^parser 
to equivalent one. In nested we only consider the first alternative in the way the 
following pseudocode suggests: 

match(nested(st ,mi , en) , s, cont) = first <- true 
match(s, (st mi en), \(s2){ 
if first ; first <- false 

; cont . call (s2) 
else ; fail 

}) 

An equivalence with top-down parser can be proved by easy induction on the 
nesting level. 

1. When expression contain no nesting we have identical implementation. 

2. Assume we proved proposition for nesting level i — 1. We prove level i by 
second induction on the number of nested calls in the continuation of level 

(a) For continuation that does not call nested we use same argument as 
in 1. 

(b) Assume we have continuation that calls nested n times. Consider first 
time we call nested. If this call fails it, by induction, also fails in the 
fully backtracking parser and we are done. 

Otherwise REG^^^ and the fully backtracking parser first try lexico- 
graphically smallest alternative in the recursion tree. If a continuation 
succeeds a derivation is same by induction. 

If a continuation fails we use assumption 2.1. of structured grammars. 
Our parser does not try alternatives further. A backtracking parser 
enumerates all derivations. As every derivation ends in same posi- 
tion and continuation will always fail. Thus the backtracking parser 
behaves like REG^^*^ parser. 
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Like not every C program is structured program not every REG grammar is 
structured one. We can use nested with empty start and end to implement 
PEG operators. This gives us inclusion PEG C REG^^*^. An opposite inclusion 
is true but not very enlightening. As there are only finitely many pairs (e, cont) 
we can for each pair write a PEG rule that emulates REG^^^ algorithm. 

For linear time guarantee we still require every recursion except left and right 
recursion to be annotated by nested. 



2.3 Relativized regular machines 

To better understand languages recognized by relativized regular expressions we 
introduce the relativized regular machines that are similar to nondeterministic 
finite state machines [SO]- We use this formalism as an inspiration for effective 
low-level implementation of parsers. 

It is easy to see that a continuation corresponds to syntactic right congruence 
class. We use representation that unifies identical expressions and continuations. 
This can be viewed as NFA state minimizatior0. 

A relativized regular machine is similar to nondeterministic finite state machine. 
A machine can be described by triple M = {S, t, a) where 

S is set of states, 

t : {S, N, S) — 7- (M, S) set of transitions and 
a C S* a set of accepting states. 

We have elementary machines that match single character. 

Transitions from state s are done in the following way. We put (Mj,Sj) = 
t{{s,i,ri-i)) then recursively call machine Mi and if it succeeds we move to its 
end position and set state to Sj. Based of accepting state this choice reaches 
we choose a next choice. 



NFA minimization is NP-hard in general case. Our approach is a good heuristic. 
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2.4 Effective implementation 



An implementation above runs in linear time but constant factor is quite high. 
For better constant factor our parser generator applies various optimizations. We 
use a low-level representation that is suitable for these optimizations. 

In this section we describe parser that does not consider semantic actions. Se- 
mantic actions will be added in the next section. 

Representation of expressions is similar to syntax tree. We use similar technique 
as compact representation of derivations in Tomita algorithm [35]: 

1. All nodes are immutable. 

2. We represent all identical subtrees by single object. When we are asked to 
construct a node optimizer first tries to simplify node by algebraic identities. 
If after simplification we obtain node identical to previously constructed 
node we return previously constructed node. 

We will again use function match (e jArgs [ ... ] ) -> Result [ ... ]. 

We will extend several times what Args and Result objects contain. Initially we 

define the following fields: 

Args . s is starting position of string, 
Result . s is end position of string, 
Args . cont is a continuation. 

Objects Args and Result have method change that creates new object with 
appropriately changed fields. 

2.4.1 Sequencing 

We represent sequencing operator head tail by object with pattern Seq[head tail] . 
Representing sequencing in this way allows tail parts to be shared. Implemen- 
tation is straightforward. 

match( Seq[head tail] , a) = match(head, a. change ( 
cont:\(a2){ 

match (tail , a. change (cont : a. cont) ) 
})) 

2.4.2 Choice and lookaheads 

Inspired by relativized regular machines we model choice and lookaheads by more 
general Switch operator. First we need add field Result . state. This state will 
be used to pass information from rules to the Switch operator. 

The Switch operator satisfies the following pattern: 
Switch [ head alt : ■[state=>tail}- merge ] 

Switch operator first matches a head. Then it looks what end state head reached 
and matches tail entry corresponding to that state. Finally it computes final 
state from states of children by merge method. 
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For simplicity in this paper we use only two states success and fail. We use 
identity function as a merge method. We also add success and fail rules with 
obvious implementation: 

match (Rule [success] ) = Result [state : success] 
match (Rule [fail ]) = Result [state : fail ] 

This is quite general operator and we illustrate its uses on several examples. 
The choice operator backtracks until success state was reached. An implementa- 
tion is: 

el|e2 -> Switch [ el {success: success 

fail: e2}] 

Lookaheads can be modeled like: 



~e -> Switch [ Seq[e success ] 

{success: fail, 
fail: empty} ] 

&e -> Switch [ Seq[e success ] 

{success: empty, 
fail: fail } ] 

A Switch makes optimizations easy. Switches can be easily composed. To com- 
pose switches A and B a simplest way is to use states that are pairs (state from 
A, state from B). We need to define merge method to compute final state. We can 
represent these pairs compactly as bit vector. Another optimization is predica- 
tion. When we know first character we can simplify expression: 

Switch [ Result [ f irst_character ] 

{ 'a': expressions that can start by a, 
'b': expressions that can start by b. 

For choice el I e2 we can, based on the result of the partial match of el, simplify 
matching of e2. For example, consider expression: 

(alb) c (dif) 
I (bic) c f 

on string "bed". 

When first alternative matches "d" then we know that second alternative will not 
match. Last choice could pass state to inform first choice about this condition. 

An implementation of Switch is the following. We hide technical details to merge 
method. For details see our implementation [5]. 
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match_memo (e , a) = 

if memo [e , a] ; memo [e , a] 

else ; memo [e, a] <- match (e, a) 

match(Switch[ head alt merge ] ,a) = 
r <- match_memo(head,a) 
r2 <- match_memo(alt [r . state] ,a) 
merge (r,r2) 

2.4.3 Iteration 

We use low-level repeat-until and Stop operators from previous chapter to rep- 
resent iteration. 

Repeat-until can terminate if and only if we encountered corresponding Stop in 
current iteration. We add stops field to Args to collect encountered stops. 

This allows to describe normal iteration e* and eager iteration e*? as (e I Stop) ** 
and (Stop I e)** respectively. Repeat-until is equivalent to right-recursion. For 
example, we can flip between rules 
R=aR|b IcRld 
and 

R = (a lb Stop I c Id Stop)*. 

Except of stop condition the implementation is nearly identical to implementation 
of * operator from Section 12.2.11 

match(Stop [st] ,a) = a.cont.calK a. change (stops : a.stops+st)) 
match (Many [st e] ,a) = 
cont2 <- \(a2){ 

if a2. stops & st ; a.cont.calK a2. change (stops :a2.stops-st)) 
else ; match(e , a2 . change (cont : cont2 )) 

} 

cont2 . call (a) 

2.4.4 Rule call 

Rule call only affects scope of variables. When no semantic actions are present 
we can directly move expression to separate rule and back. 

match(Rule[ e ], a) = match(e ,a) 

For nested we use similar implementation as before. 

match(Nested [st mi en], a) = 

r = match_memo (SeqEst mi en] , Args [s : a. s , cont : \ (m) {success}] ) 
if r . state==success ; a. cont . call (a. change (s : r . s) ) 
else ; fail 
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2.5 Semantic actions 



We add semantic actions as was explained in the previous chapter. 

While semantic actions are easy to add they complicate other parts of the parser. 

We add the following fields: 
Args . closure closure for semantic actions. 
Args . returned the result of last expression. 
Result .returned returned result. 

We model semantic act as a function that modifies arguments. For simplicity we 
model variable binding by semantic act. 

match( Act [ f ] ,a ) = a2 <- f . call(a. closure) 
a. cont . call (a. change (a2) ) 

Now we are ready to add enter operator. 

match ( Enter [el e2] , a) = 

match(el , a. change (cont : \(a2){ 

match(a2 . change (s : a2 . returned) , cont : \(a3){ 
a. cont . call (a3 . change (s : a. s) ) 

} 

} 

Semantic actions in rule invocation have shared scope. We use closure object to 
achieve this. A rule invocation becomes: 

match ( Rule [ e ] ,a ) = match (e, 

a. change (closure :new_closure , 
cont : \(a2) { a. cont . call (a. change (s : a. s , 

returned : a . returned) ) } 

) 

We also add semantic predicates from the previous chapter This comphcates 
memoization and we, for simplicity, disable memoization when semantic predicate 
is present. 

Support parametrized rules and lambdas is bit technical to add. For parametrized 
rule we first model arguments by semantic act bound to argument variables. Then 
we add field consisting of pairs (argument variable,parameter variable) and we 
initialize new closure according to pairs. For lambda we bind (expression, closure) 
pair to corresponding variable. We disable memoization when parametrized rule 
is present for same reasons as with semantic predicate. 

Memoization becomes more technical. A simplest way how to get linear time 
complexity is to use two pass parser which in first pass run parser from Section 
12.41 and second time we just constructs parse tree. We refine this idea and run 
both phases in parallel. We use functor f orget_semantic_actions: 
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match_memo_state(e,a) = 

if (has_predicate (e) I has_predicate(a) ) ; match(e,a) 

else ; e2 <- f orget_semantic_actions (e) 
a2 <- f orget_semantic_actions (a) 
if memo[e2,a2] ; memo[e2,a2] 
else ; memo[e2,a2] <- match (e, a) 

A simple implementation of Switch can be: 

match (Switch [ head alts merge ] , a) 
r <- match_memo_state (head, a) 
r2 <- match_memo_state (alts [r . state] , a) 
if r2 . state==f ail 

fail(r2) 
else merge (mat ch (head, a) , 

match(alts [r. state] ,a)) 

Sometimes Switch knows that the result is not needed. Then we can directly 
call expression simplified by f orget_semantic_action. This always happens for 
lookaheads. 
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2.5.1 Time complexity 



Ford [12] rewrites iteration to recursion for linear time complexity. However most 
implementations naively use a loop. 

It is possible to construct test cases where arbitrary (say k) number of loops are 
nested together and each fails at the end of input. This leads to time complexity 
at least vJ' for arbitrary k. This can be seen on the following expression: 

( ( ( ( 'a' )* 'b' 
/ 'a' )* 'c' 
/ 'a' )* 'd' 
/ 'a' )* 'e' 

on "aaaaaaaaaaaaaaaaaaaaaaa. . . " 

We memoize continuations precisely for this reason. 

For parser from Section 12.41 there are only finitely many expressions and contin- 
uations. Thus there are only 0{n) memoization pairs (e,a). 

With semantic actions we sometimes need to recalculate the result. For a given 
pair (nested, position) we need to recalculate result of every (e,a) pair at most 
once. For general REG^^*"^ expressions time complexity O(n^) follows. 

For structured grammars this behavior can not happen. We do not have to 
recalculate when the result state is fail or we match in lookaheads. What is 
left is that we could have two invocations of same nested expression with two 
different positions that recalculates same (e,a) pair. But this would mean that 
both invocations will be accepted with same end position which is in contradiction 
with condition 2.2 of structured grammars. Consequently the parser of structured 
grammars runs in linear time 

With semantic predicates we can not give any complexity guarantee. To inte- 
grate them correctly we disable memoization when continuation contains seman- 
tic predicate. 



2.6 Memory consumption of REG^^^ parsers 

Mizushima et al [25] propose way to decrease the memory usage. We describe 
similar but simpler approach. 

The parser implementation maintain set of live branches in a list live. The list 
is maintained in the following way: 

• When parser descends into choice operator then its branches are added to 
live list. 

• When parser descends into branch, then it is removed from live list. 
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• When parser encounters cut then branches that were cut are removed from 
live hst. 

When live hst is empty we know that subsequent parsing can not return to 
position smaUer than current. We can safely delete all memo entries with smaller 
position. One can observe that live list is not needed. The implementation can 
be further simplified by only keeping track of the size of the list in a counter 
alternatives. 

The parser then deletes stale entries from memo table lazily. It keeps track of 
the rightmost position where alternatives was zero. At a time table expansion 
is needed, all earlier entries are deleted. This avoids the need for the expansion 
if the table after deletion is at most half full. 

Note that if we want to incorporate destructive semantic actions we can in same 
way defer their evaluation until alternatives is zero. 

For practical grammars this extension gives nearly constant memory usage. How- 
ever we can construct examples where this approach does not help, for example 
in expression: 

exp* 'x' I exp* 'y' 
we need to keep memoization entries until end is reached. 



Memoization in general setting 

The memoization is viewed as alternative to dynamic programming. A naive 
memoization can have big memory consumption. We show that with few simple 
trick we can obtain better performance with order of magnitude smaller memory 
usage. Memoization strategy and automatic memoization were unsurprisingly 
developed in context of context-free parsing [27]. We use memoization strategy 
that applies to reducing memoization memory consumption in general. 

First step to reduce memory usage is to save values for parameters in the hash 
table. This gives memory consumption proportional to number of saved values. 

When we do not ask question: "What functions must be memoized?" but right 
one "What parameters must be memoized?" then solution is surprisingly simple: 
We have only memoize those parameters that took at least say 512 cycles to 
compute. 

There are three additional improvements: 

• We can count only cycles that were not memoized by son rule. 

• We keep additional small (say 512 element) directly mapped write-back 
cache. We use this when we notice that values before our threshold are 
typically rarely reused after say 100 steps. 

• Some functions never reach our threshold so we can use separate table to 
save unnecessary lookups. We are more worried that these lookups thrash 
cache than direct cost. 
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This has same asymptotic time complexity as when we memoize everything be- 
cause rule is memoized at most once and when we do not memoize time complexity 
is constant. 

There is technical problem how measure number of cycles. While x86 offers 
timestamp counter and core2 it takes about 24 cycles. We currently use simple 
estimate by counting number of functions that we called as we typically have 
constant overhead. Alternative way is first run a version that gathers profiling 
data and then use estimates from that version. 

We illustrate our ideas on the following memoized version of Fibonacci numbers 
in C language: 

typedef struct {long time; long saved; long result;} tinie_struct ; 
time_struct timestamp; 

struct{int key; long value;} cache [512]; 
long memo [1000000] ; 
time_struct memoize_start (int key){ 
if (cache [keyyo512] .key==key){ 

timestamp . result=cache [key°/o512] . value ; 
return timestamp; 
} else if (memo [key] ) { 

timestamp . result=memo [key] ; 
return timestamp; 
} else { 

timestamp . result=0 ; 
return timestamp; 

} 

} 

void memoize_ended(time_struct started, int key, long value) { 
long time = time stamp. time -started. time; 
long saved = timestamp. saved-started. saved; 
long time_self = time - saved; 

cache [key°/o5 12] .key=key; cache [key7o5 12] . value=value ; 
if (time_self > 128){ 

memo [key] = value ; 

timestamp . saved += time_self ; 

} 

} 

long fib (int n){ 

time_struct started=memoize_start (n) ; 
timestamp . time++ ; 

if (started. result) return started. result; 

if (n<2) return 1 ; 
result=f ib(n-l)+f ib(n-2) ; 

memoize_ended (started , n , result) ; 
return result; 

} 



48 



2.7 From REG^^^ back to REG 



We establish a reg functor. We use it to analyze REG expressions. 

A reg functor assigns to each relativized regular expression e a regular expression 
reg{e). A reg{e) satisfies approximation condition that if e accepts s then reg{e) 
accepts s but converse is not necessary true. 

We can extract useful information testing if the intersection with a suitable regular 
language is empty. 

empty(e) = reg(e ) fl reg( ' ' ) 

f irst_char ( ' c ' , e) = reg(e ) fl reg('c' .*) 
overlap(el,e2) = reg(el .*) fl reg(e2 .*) 



If overlap (el ,e2) does not match anything then we can freely flip between 
el|e2 and e2|el. Also note that if this occurs then choice is deterministic and 
we do not have to backtrack if first alternative happens. 

Mizushima |25J also transforms grammar to more deterministic one. We use 
stronger analysis. Using overlap we can determine where we can insert return 
states that inform Switch that next alternatives can not occui^l. 

While bounds minsize(e), maxsize(e) on minimal and maximal sizes of string 
that matches e can be discovered by intersecting with suitable languages it is 
faster to compute them by dataflow analysis. 

Functor reg can be defined in the following way: 

reg( 'c' ) = c 

reg( r ) = reg(r) 

reg( a* ) = reg (a)* 

reg( nested (start , mid, end) ) = reg(start) .* reg(end) 

reg( a b ) = reg(a) reg(b) 

reg( alb) = reg(a) I reg(b) 

reg(&a b ) = reg(a) fl reg(b) 

We use rough approximation of middle of nested. In typical case inside nesting 
could be practically anything so trying to improve this approximation leads only 
to larger expressions without any new insights. 

We shall remark that better result can be obtained by first using relativized reg- 
ular machine and then converting to regular machine. This gives two advantages: 
First is that Switch describes also lookaheads and we can describe intersection 
by lookahead. 

Second is that we can use facts: 

If A is unambiguous then ABnAC = A (BflC). 
If A is unambiguous then AB|AC = A(B|C). 

As there only finitely many {continuation, cuts, stops) triples size of our machine 
is finite. 



^We can also consider continuations for better results 



49 



2.8 Problems of left recursion 



Left recursion handling deserves topic of its own. Various approaches were sug- 
gested and various counterexamples found. 

In PEG implementing left recursion correctly is an impossible task. Consider 
rule: 

L = &( L 'cd' ) 'abc' # a -> abc -> abcbc -> ab 
I &( L 'bed' ) 'ab' # " | 

I L 'be' # I V 

I L ' cb ' # abcbcb <- abcb 

I 'a' 

On "abcbcbcd". 

It creates infinite cycle in the recursion. This problem is more fundamental as 
there is a paradox: 

L = ~L 

We reject such self references and raise an error when lookahead refers to possibly 
indirectly left recursive rule. Note that in boolean grammars same problem was 
recognized [29] . 

Left recursion can be handled by recursive descend/ ascend. A rule: 

L = L 'be' I L 'c' I 'ab' I 'a' 
on "a6c" is recognized as Y^T^cjj" by recursive descend parser but as "((ab)c)" by 
recursive ascend one. All previous approaches in PEG and context-free bottom- 
up parser used a recursive ascend variant of left recursion. A simplest algorithm 
is attributed to PauU [l\ . It consist of rewriting direct left recursion to equivalent 
rule: 

L = La|b|Lc|d 

L = (b I d) (ale)*. 

An indirect left recursion is removed by inlining and thus reducing to direct 
recursion case. 

In 1965 Kuno [21] suggested to limit recursion depth by n. It was rejected in PEG 
setting as in presence of semantic predicates some recursive rules need more than 
n calls. Also it was not clear how handle infinite streams. But it was rejected 
prematurely. 

Using reg functor (or simple dataflow) we can for each expression compute lower 
bound on minimal length of a string that matches that expression. Using this 
information we can easily estimate minimal size of current continuation. When 
this bound exceeds the length of our string we can fail. 

For infinite streams we can guess bound by guessing initially 1 and doubling 
bound when recursion could continue. We do not use this approach as it has an 
exponential complexity in the worst case. 

Note that same technique can improve to Frost's algorithm [13j . 
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In packrat setting Ford used Paull algorithm to remove direct left recursion. He 
rejected to support left recursion with the following reason [T2] : 

"At least until left recursion in TDPL is studied further, utilizing such a feature 
would amount to opening a syntactic Pandora's Box, which clearly defeats the 
pragmatic purpose for which the simple left recursion transformation is provided." 

Warth, Douglass, Millstein |12] attempted to add runtime detection of left recur- 
sion. With bit of imagination it could be interpreted as doing Paull algorithm at 
runtime. However this approach has several flaws. 

One discovered by Tratt [ID] is that seed growing introduces ambiguity of direct 
left recursion when right recursive alternative is also present. 

A revised algorithm of Tratt still contains a flaw. Tratt at certain times forbids 
expansion of right recursion. 

Tratt approach fails to handle right-recursive lookahead as the following coun- 
terexample shows. 



Third issue was discovered by Peter Goodman [13]. Warth algorithm does not 
handle the following grammar. 

A = A 'a' / B 

B = B 'b' Ik / C 

C = C 'c' / B / 'd' 

Medeiros in unpublished paper [23] devised a revised version of seed growing 
algorithm. 

One of possible advantages of seed growing could be support of higher order 
parametrized rules. In amethyst parser most of higher order rules are inlined 
making this point a moot one. 

2.8.1 Left recursion in parser 

We combine two techniques. First we just rewrite recursion by Paull algorithm. A 
second technique is that continuation passing style does implicit finite state ma- 
chine minimization. This is simpler and leads to smaller grammars than Moore's 
left corner transform heuristic 126 1. 



We handle left recursion inside iteration by unrolling one level. 

With some bookkeeping we can transform left recursion to recursive descend. Idea 
is that each alternative returns its derivation and we choose a lexicographically 
smallest in recursion tree. This can be done in 0(1) time using dynamic lowest 
common ancestor [TUj . 



L = 



= L 'a' 
I ~('b' L) 'b 
I 'c' 
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2.9 State handling 



We show how to implement techniques for state handhng that we described in 
section 11.141 

We use a simple memoizing top-down PEG parser implemented in Ruby as an 
example. Implementation in amethyst uses similar ideas but intermixed with 
handling of other features. 

class Match 

["src" , "pos" , "result" , "contextual_arguments" , 
"contextual_returns" , "locals"] .each{ Iname | 
eval " 

def #{name} ( ) ; Ohash [\"#{name>\"] ; end 
def #-[name]-=(v) ;@hash[\"#-[name>\"]=v; end 

} 



def timestamp ; deep_clone (Ohash) ; end 
def revert (ts); @hash=ts ; end 

def memo_id; [src,pos,deep_clone(contextual_arguments)] ; 
attr accessor :memoized #we don't revert memo table 



end 



One could instead of deep cloning track what changes we did and revert them. 
This has same time complexity as without persistence because every revert will 
be payed by corresponding addition. 



def match (exp) 
case exp 
when Call 



id=memo_id + exp. name 
if !memoized[id] 
ts =timestamp 

locals , contextual_returns = {} 
r=match($rules [exp. name] ) 
memoized [id] =deep_f reeze (clone) 
revert (ts) 
end 

c=memoized [id] 
result , pos=c . result , c . pos 
c . contextual_returns . each{ I k , v | 
contextual_returns [k] « v 

} 



{> 



when Char; 



when Seq 



when Or 



result=exp. char; src+=l 
result= : fail 



when Act 
when Bind 
end 

return result 
end 



if src [pos] ==exp . char 
else 
end 

exp . each{ | seql 

if match(seq) == :fail; return :fail 
end 

} 

exp . each{ | alt | 
ts = timestamp 
if match(alt) != :fail; 
else ; 
end 

} 

return :fail 
result=exp . call (self) 
locals [exp . name] =result 



return result 
revert (ts) 
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3. Dynamic parsing 



Normal parser processes files in batch fashion. Amethyst allows dynamic parsing 
where the user is allowed to add and delete characters from string and query 
current parser output. 

Editors and IDE try to maintain syntax highlighting and error detection often in 
ad-hoc way. Syntax highlighting typically uses regular expressions to determine 
meaning of text edited. This yields several problems, one is that regular expres- 
sion can be confused with certain inputs. Other is updating regular expression 
for new versions of grammar. Dynamic parsing solves these problems in a robust 
way 

In this chapter we develop a generic way how transform memoizing top-down 

parser to dynamic one. The update operation of the dynamic parser has the worst 
case time complexity O(rlnn) where r is number of rules that need recomputed. 
For typical workloads running time is 0(r). Our techniques can be applied to 
packrat parsers, parsing algorithm of Frost, and REG^^*^ parser. 

Main idea of our approach is to annotate memoized rule with an interval of input 
used to calculate it. We use a balanced tree to detect if this interval changed or 
not. After receiving an update to the input we update version of corresponding 
element. When we need to recalculate memoized rule we check if there was a 
change in its interval and recalculate it as necessary. 



3.1 Interface to memoizing top-down parsers 

All three algorithms (REG^^*^, PEG, Frost's algorithm, ...) allows separation of 
memoization into independent module. In dynamic parsing this module serve as 
intermediary interface between user (IDE, editor, ...) and parser. 

Our parser can be easily generalized to matching arrays of arbitrary type with 
no modications in algorithm. 

A user interface consists of following four methods: 

chr(p) Character at position p 

ins(p,c) Insert character c at position p 

del (p) Delete character at position p 

parse Return result of parser 

Interface with parser is more interesting. A parser can access string only by 
pointers that always point to same character regardless of modifications. 

char(ptr) Value of current character. 

next(ptr) Next character. 

get_memo(rule,ptr) Returns memoized value. 

set_memo (rule, ptr, value) Memoizies given value. 
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We assume that parser calls get_memo calls on entering rule and set_memo on 
exiting rule. Parser does not have to call set_memo if it decides not to memoize 
a rule. 



3.2 Data structure 

We present data structures for C'(rlnn) time bound. 

Our structure is a balanced tree. We maintain several properties: 

value value of character 

sons number of sons in subtree 

ts timestamp 

maxts maximal timestamp of son nodes 

memoized Memoization entries starting at this position. 

We use timestamp that increases after call of each parse. Each insertion/deletion 
gets assigned this timestamp 

Our implementation needs several auxiliary methods: 

timestamp (first, last) maximal timestamp in interval specified by first and last. 

index (ptr) position of character in current string. 

rindex(n) pointer to n-th character in current string. 

Writing balanced tree that supports chr , ins , del , timestamp , index , rindex meth- 
ods in O(lnn) while maintaining properties above is typical homework exercise. 
Note that queries that we made exhibit spatial locality. Thus a splay tree [37] 
looks like good candidate to obtain 0(1) running time in practice. 

Implementing our data structure as tree with node for each character is unpracti- 
cal. Observe that order in which we modify string between calls to parse method 
is not important. We can modify our data structure to rope data structure [6] 
with property that leaf substrings have same timestamp. When splitting sub- 
string we need also update table entries. As character can participate in at most 
O(lnn) splits amortized 0{lnn) time complexity still holds. 

There is a technical problem with the deletion. We need to save somewhere that 
deletion occurred. We keep nodes that contain no character for this purpose. 
Luckily we can always merge two empty adjacent nodes into one. It is easy to 
see that number of empty nodes will be at most the number of nonempty ones. 
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3.3 Algorithm 



We will present two implementations. First implementation is an extension of 
amethyst that overwrites '.' operator. Second is in pseudocode that serves as 
overview of the effective C implementation from lib/dynamic subdirectory of the 
amethyst project. 

amethyst Dynamic { 

init = { @(§memoized={]- ; @@rightmost=0]- 
if(x) = &-[x} 

memo (rule) = 

{id (rule , src ,pos)} : id 
{pos} : oldpos 
{©©rightmost}- : oldright 
{posj- : ©©rightmost 
( if ( !memoized[id] ) 

(apply(rule) | {"failed"}) : result 

{ memoized [id] =Memo [pos-oldpos ,@©rightmost-oldpos .result] } 
I -> nil 

) 

{ pos=oldpos+memoized[id] .advance 

©@rightmost=max (oldright ,pos+memoized [id] .rm_advance) 

> 

( if (memoized[id] .result=="f ailed") fail 

I -> memoized[id] .result 

) 

anything = if(pos>=len) fail 

I {©©rightmost=max (©©rightmost ,pos) ;pos=pos+l} -> src[pos-l] 
#seq is analogous 

} 

class Memo 

attr_accessor : advance , : radvance , : result 
def self. [] (advance, radvance, result) 
m=Memo . new 

m . advance , m . radvance , m . result=advance , radvance , result 
m 
end 
end 

Our algorithm maintains stack that mirrors a call stack of parser. For each rule 
we find a rightmost position that can affect result of rule. 

stack_struct stack 
stack_push(rule ,ptr) { 

stack. push 

stack . top . rule=rule 

stack. top. ptr =ptr 

stack . top . last=ptr 

} 

stack_pop (rule , ptr) { 
last=stack . top . last 
stack. pop 
update_last (last) 

} 
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update_last (ptr) { 

if ( index (ptr) >index (stack . top . last) ) 
stack . top . last=ptr 

} 

There is technical issue that saving rightmost position as pointer is unwieldy. 
Instead we represent rightmost position as number of characters from starting 
position. With this improvement we for example do not have to worry what 
happens if rightmost position is deleted. 

char (ptr) { 

update_last (ptr) 
return ptr. value; 

} 

get_memo (rule , ptr) { 

if (ptr . memoized [rule] ) 

last=rindex (ptr . index+ptr . memoized [rule] . advance) 
if (timestamp (ptr , last ) ==ptr . memoized [rule] . saved) { 
update_last (last) 
return ptr .memoized [rule] .value 

} 

} 

stack_push (rule , ptr) 
return nil 

} 

set_memo (rule , ptr , value) { 

while (stack. top. rule !=rule II 
stack . top . ptr ! =ptr) 
stack_pop //parser decided not to memoize 
ptr .memoized [rule] .value = value 
ptr .memoized [rule] .advance = stack. top. last . index 

- ptr . index 

ptr. memoized [rule] .saved = timestamp(ptr, stack. top. last) 
stack_pop 
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4. Peridot 



We use peridot as an example of using amethyst in language design. 



4.1 Basic concepts 

Peridot does not differ much from mainstream dynamic programming languages. 
We assume that reader is familiar with concepts like class, method, dynamic 
dispatching. 

Currently Peridot has classes for integers, arrays and strings with basic methods 
and operators. For their description see Peridot documentation. 

As in Ruby variables are defined by assignment. 



4.2 Peridot grammar in amethyst 



We use a parts of Peridot grammar to illustrate use of amethyst. Entire grammar 
can be found in Appendix O 

We try to design an operator precedence that avoids pitfalls of C language that 
expressions 1 + 1<<2 == 5 and 1&2 == 5&2 are false. 

binary_op(exp,oper) = apply(exp):a ("" apply (oper) : op 

apply (exp) :b -tcall(op,a,b)}:a)* -> a 



expr_or_l = binary_op( ' expr_and_l ' 
expr_and_l = binary_op( 'expr_cmp' 

expr_cmp = biiiary_op( ' expr_arl ' , ( | 

I '>=' I 

expr_arl = binary_op( ' expr_ar2 ' , ( | 
expr_ar2 = binary_op( ' expr_or ' , ( | 

expr_or = binary_op( ' expr_and ' , ( | 
expr_and = binary_op( ' expr_ar3 ' , ( I 



(I "II" I ) ) 
(I "&&" I)) 



I |<=| I •<=>' 
l'=='l'!=' I)) 



I '-' 

"/' '7.' 



I)) 
I)) 

I)) 
I)) 



expr_ar3 = binary_op( ' expr_pref ixed' , ( | '**'!'«'''»' |)) 



pref ix_op(exp,oper) = apply(o):op apply(exp):e -> call(op,e) 

■' I)) 



expr_pref ixed = "+" expr_pref ixed 

I pref ix_op( ' expr_pref ixed' , ( 
I expr_postf ixed 
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4.3 Functional programming style 



Peridot supports several functional programming features. 
Lambdas are supported with same syntax as in amethyst. 

Ruby extensively uses the block passing style. You use it even for loops: 

4.times{ | i | 
puts i 

> 

#equivalent code 
4.tiiiies(&proc{ I i I puts i}) 

This construction can be viewed clS Sb CclSG of the continuation passing style [55] . 

We want have better support for continuation passing style. 

In Peridot a yield keyword returns a (result, continuation) pair. The block syntax 
a (b) {block} is a shortcut for: 

cont = a(b) 

while (cont . is_a? (Continuatation) ) 

r = block. call (cont. result) 

cont = cont. call (r) 
end 

When no block is specified you can use returned Continuation object is similar 
way as Enumerator object in Ruby. A main difference is that in Continuation 
object communication goes in both directions. 

As hypothetical example consider a binary search tree that implements a gener- 
ic binary search bsearch method that takes a block with supplied comparison 
method. For example guess a number game can be done as: 

t=Tree .new 
t.bsearch{ |x| 

puts "is"+x+"more/less/equal to your number?" 

gets 

> 

We can change numbers passed to block by map method: 

t=Tree .new 
c=t .bsearch 

c=c . map{ I X I roman_numeral (x) } 
c . call{ I X I 

puts "is"+x+"more/less/equal to your number?" 
gets 

> 

And values returned back. For example reversing direction can be done by chang- 
ing third line of previous example to: 

c=c . rev_map{ | x | 
if x=="more" 

"less" 
else 

if x=="less" 

"more" 
else 

"equal" 
end 
end 

> 
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A. Amethyst syntax summary 

In this section we recapitulate semantic of amethyst in systematic manner. 



Rules 



Pattern 


Description 


rule = exp 
rule 

rule(vl,v2. . . ) = exp 
rule(cl,c2. . . ) 
Grammar : : rule 
(1 e 1) 


Rule definition. 
Rule call. 

Parametrized definition. 

Parametrized call. 

Call rule from given Grammar. 

Lambda 



Semantic actions an predicates 



Pattern 


Shortcut 


Description 


{c} 

&{c} 

~{c} 

-> c newline 


core 
core 
&{!c} 
{c} 


Semantic action. 
Semantic predicate. 
Negative semantic predicate. 
Alternative syntax of semantic action. 



In semantic actions and predicates we do following substitutions. 



Pattern 


Shortcut 


Description 


Omethod 

©Class 

@>name 

(§<name 

(lei) 


src .method 
Class . create (...) 
argument s [ " name " ] 
returns ["name"] 

core 


method call (parameters are allowed) 

Construct object 

Contextual argument 

Contextual return 

Lambda 



Variable binding 



Pattern 


Shortcut 


Description 


e : V 
e:{c} 
e: [c] 


core 

e:it {c} 
e:{v « it} 


variable binding 
For conversions etc. 
Append result to array c. 
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Sequencing and choice 



Operation 


Expansion 


Description 


e? 
Cut 

~e 

el & e2 


e 1 iiiix/ 
auxiliary 

e Cut fails I {nil} 
~~el e2 


Make e optional. 
Like ! in prolog 
Negative lookahead. 
Positive lookahead. 



Iteration 



Pattern 


Shortcut 


Description 


e** 


auxiliary 


repeat-until 


Stop 


auxiliary 


Stop iteration 


e* 


e** 


When e contains Stop, 


e* 


(el Stop)** 


otherwise. 


break 


Cut Stop 


Possible expansion. 



Object orientated constructs 



Pattern 


Shortcut 


Description 


el[e2] 


core 


Enter operator 


[e] 


. [e] 


We can omit leading . . 


el=>e2 


el: a {[a]}[e2] 


Pass operator 


Class 


member (Class) 


Test class membership 
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B. Standard prologue 



amethyst Amethyst < AmethystCore { 
space = <\s\t\r\n\f > 
spaces= space* 
tokeii(s) = spaces seq(s) 
= space 

lower = <a-z> 

upper = <A-Z> 

alpha = lower | upper 

alnum = alpha | digit 

digit = <0-9> 

xdigit = <0-9a-f A-F> 

word = alpha | ' _ ' 

newline = '\r\n' | '\r' | '\n' 
line = (newline break | .)*Kit*""} 

empty = -> nil 
eof = ~ . 

seq(s) = _seq(s) {s} 

int = ('-'K""}):s ( 'Ox' <0-9a-fA-F>+ | 'Ob' <01>+ 

I 'Oo' <0-7>+ I <0-9>+) [] :n {(s+n*"") .to_i} 

number = int 



find(exp) = (apply (exp) : e break | .)* .* -> e 
replace(exp) = (apply(exp) | . ) * : {it*" "]- 

until(e) = ( seq(e) break 

I ('\\':[x])? .:[x] 
)* > x.join 

listOf (rule, del) = apply(rule) : [f] (seq(del) apply(rule) )* : [*f ] 
I empty -> [] 

reverse(l) = {OOrev I | =Hash.new{ | h,k | h[k]=k. reverse }} 
_reverse (@@rev [@self ] ) 
apply (1) :rev 
_reverse (@@rev [(§self ] ) 
{rev} 

fails = &{false} 

char= .:c &{c.is_a? String } -> c 

member (x) = .:a &{x === a} {a} 

true = member (true) 

false = member (false) 

nil = member (nil) 

clas(cls) = member(cls) 

range_in(a,b) = member(a. . b) 
range_ex(a,b) = member(a. . .b) 
regch(regex) = member (regex) 

parse (rule , obj , a) = {obj} [{apply (rule , *a) }: r {self .prof _report;] 
nested(start ,mid, end) = seq(start) apply (mid) seq(end) 
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C. Peridot grammar 



class Peridot_parser < Amethyst 

def call (name , *args) 

Call [{ : name=>leterize (name) , : ary=>args}] 

end 
end 

amethyst Peridot_parser{ 

root = (body | defi | sequence)*: a .* {a} 

body = "class" "" name: name defi*:ary "end" OKlass 

defarg= <",)>+: {it . join} 

defi = "def" "" defname:name {["obj self"]}:args 

'(' listOfCdefarg' ,','): [args] ')' 

sequence: [ary] "end" @Def 
name = <a-zA-Z_>:s <a-zA-Z0-9_>* : {s+it*" "> 
defname = (<~ \t\r\n()>)* :x {leterize(x*"")} 

atom = "" 

( number :n -> CCode ["Int (#{n» "] 

I '"' until('"'):s -> CCode ["Str(#{s . inspect}) "] 

I 'd' expr:e "D" -> Lambda [e] 

I ' (' expr:e ")" e} 

I 'if "(" expr:expr ")" block: block 

If [{ : expr=>expr , : block=>block}] 
I ('}' break | &'{' atom: {'{' +it [0] +'}'}: [s] I .:[s])* 

-> CCode [s*""] 
I 'yield' atom: a -> Yield [a] 
I method("self ") 
I local 
) 

local = "'end' name: name OVar 
args = listOf ( ' expr ',',') 

method (obj) = name: name '(' (args | {[]}):arg ")" 

( block :b -> Iterate [call (name, obj ,*arg) ,b] 

I -> call (name , obj , *arg) 

) 

expr_postf ixed = atom: a ( 

( ' [' args:arg "] " " = " expr:arg2 -> call ("[]=", a, *arg,arg2) 
I '[' args:arg "] " -> call(" [] " ,a,*arg) 

I ' . ' method(a) 

I '.' name: name -> call (name, a) 

):a 
)* {a} 

expr = expr_ass 



binary_op(exp,oper) = apply(exp):a 

("" apply (oper) : op apply (exp) :b {call(op,a,b)}:a)* -> a 

expr_ass = "" name: name '=' expr: expr ©Assign 
I expr_or_l 
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expr_or_l = binary_op( ' expr_and_l ' , ( | "II" I)) 
expr_and_l = binary_op( ' expr_cmp ' ,(| "&&" I)) 

expr_cmp = binary_op( ' expr_arl ' , ( | "<" r'<=" r'<=>" | 

M y — M I M ^ M I II — — M I M j _ II I ) ) 

expr_arl = expr_ar2:a (("+" expr_ar2 

I"-" expr_ar2:-[call('-' ,it)}) :b 
{call('+' ,a,b)}:a )* -> a 

expr_ar2 = binary_op( ' expr_or ' ,(| '*' I'/' r7o' I)) 

expr_or = binary_op( ' expr_and ' , ( | 'I' |)) 

expr_and = binary_op( ' expr_ar3 ' , ( | '&' j)) 

expr_ar3 = binary_op( ' expr_pf x ' , ( | '**'|'«'|'»' |)) 

pref ix_op(oper ,exp) = apply (oper) : op apply(exp):e -> call(op,e) 

expr_pfx = "+" expr_pfx 

I pref ix_op( ( I ' - ' | ) , ' expr_pf x ' ) 
I pref ix_op( ( I < ! ~> I ) , ' expr_pf X ' ) 
I expr_postf ixed 

block = "{" sequencers '}' {s} 

sequence = expr: [ary] ( newline expr:[ary])* OSeq 
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